-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-11310: [Rust] implement JSON writer #9256
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #9256 +/- ##
==========================================
- Coverage 81.89% 81.87% -0.02%
==========================================
Files 215 216 +1
Lines 52988 53097 +109
==========================================
+ Hits 43392 43474 +82
- Misses 9596 9623 +27
Continue to review full report at Codecov.
|
jorgecarleitao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gave this one a quick look, and it looks good so far. I think the nulls are not accounted for.
I left some comments accordingly.
| } | ||
|
|
||
| pub fn as_list_array<S: OffsetSizeTrait>(arr: &ArrayRef) -> &GenericListArray<S> { | ||
| pub fn as_generic_list_array<S: OffsetSizeTrait>(arr: &ArrayRef) -> &GenericListArray<S> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
breaking change to make naming more consistent with array type.
|
@jorgecarleitao @nevi-me ready for review. |
jorgecarleitao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work, @houqp ! This looks super clean and IMO ready to go.
I would add a note to the readme where we discuss json readers. IMO this is a major feature!
nevi-me
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the implementation @houqp, minor comments
| struct_array_to_jsonmap_array(as_struct_array(array), array.len()); | ||
| jsonmaps.into_iter().map(Value::Object).collect() | ||
| } | ||
| _ => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there'd be demand for dictionary support, but we can add that separately
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I agree -- on both counts :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, i was planning to add dictionary support, but this patch set grew larger than I expected, so I decided to leave that to a follow up PR for easier review.
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a quick skim at this PR and it looks nice to me. Thank you @houqp
|
I plan to review this PR later today and merge unless I hear otherwise |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really nice work @houqp
…s newline delimited json streams
## Rationale
Currently the Arrow json writer makes JSON that looks like this (one record per line):
```json
{"foo":1}
{"bar":1}
```
Which is not technically valid JSON, which would look something like this:
```
[
{"foo":1},
{"bar":1}
]
```
## New Features
This PR parameterizes the JSON writer so it can write in either format. Note I needed this feature for in IOx, in influxdata/influxdb_iox#870, and I want to propose contributing it back here).
## Other Changes:
1. Added the function `into_inner()` to retrieve the inner writer from the JSON writer, following the model of the Rust standard library (e.g. [BufReader::into_inner](https://doc.rust-lang.org/std/io/struct.BufReader.html#method.into_inner)
2. Per Rust standard pattern, I change the JSON writer so that it doesn't add any Buffering (via `BufReader`) itself, and instead allows the caller the choice of what type of buffering, if any, is needed.
3. Added / cleaned up a bunch of documentation and comments.
## Questions
I went with parameterizing the `Writer` output as a trait rather than runtime dispatch, for performance. This shouldn't have backwards compatible issues Given the writer has not yet been released yet (introduced by @houqp #9256)
However would people prefer a single `Writer` that took an `Options` struct or something to determine how it wrote out data?
Closes #9575 from alamb/alamb/json_arrays
Authored-by: Andrew Lamb <[email protected]>
Signed-off-by: Andrew Lamb <[email protected]>
…s newline delimited json streams
## Rationale
Currently the Arrow json writer makes JSON that looks like this (one record per line):
```json
{"foo":1}
{"bar":1}
```
Which is not technically valid JSON, which would look something like this:
```
[
{"foo":1},
{"bar":1}
]
```
## New Features
This PR parameterizes the JSON writer so it can write in either format. Note I needed this feature for in IOx, in influxdata/influxdb_iox#870, and I want to propose contributing it back here).
## Other Changes:
1. Added the function `into_inner()` to retrieve the inner writer from the JSON writer, following the model of the Rust standard library (e.g. [BufReader::into_inner](https://doc.rust-lang.org/std/io/struct.BufReader.html#method.into_inner)
2. Per Rust standard pattern, I change the JSON writer so that it doesn't add any Buffering (via `BufReader`) itself, and instead allows the caller the choice of what type of buffering, if any, is needed.
3. Added / cleaned up a bunch of documentation and comments.
## Questions
I went with parameterizing the `Writer` output as a trait rather than runtime dispatch, for performance. This shouldn't have backwards compatible issues Given the writer has not yet been released yet (introduced by @houqp apache/arrow#9256)
However would people prefer a single `Writer` that took an `Options` struct or something to determine how it wrote out data?
Closes #9575 from alamb/alamb/json_arrays
Authored-by: Andrew Lamb <[email protected]>
Signed-off-by: Andrew Lamb <[email protected]>
Other than basic type support, this PR also covers the following cases: